Introduction
We set out to find an untapped market by creating a popular beer in a state that is target rich for a new brewery. Given a data base that houses #information about all the beers and breweries in the United States we will make key investment recommendations to Budweiser. We will recommend a #general location for this new Brewery and we will recommend the style of beer for the market we are entering. Finally, we will be specific about the #optimal quantities of both Alcohol by Volume (ABV) and International Bitterness Units (IBU) for our new flagship Beer.
Beers <- read.csv("C:/Users/Tricia/Desktop/SMU/DS 6306/MSDS_6306_Doing-Data-Science-Master/Unit 8 and 9 Case Study 1/Beers.csv", stringsAsFactors = TRUE, na.strings='..')
Breweries <- read.csv("C:/Users/Tricia/Desktop/SMU/DS 6306/MSDS_6306_Doing-Data-Science-Master/Unit 8 and 9 Case Study 1/Breweries.csv", stringsAsFactors = TRUE, na.strings='..')
#In order to start the project the files were merged together by utilizing a common column 'Brewery ID'.
Brews <- merge(Beers, Breweries, by.x = "Brewery_id", by.y = "Brew_ID", all = TRUE)
#Change variable names
colnames(Brews) <- c('Brewery_id','Beer','Beer_ID','ABV','IBU','Style',
'Ounces','Brewery','City','State')
#Ensure files are merged and variable names are correct
#Create a data frame for Brews.
df1 <-Brews %>%
count(State, sort = TRUE, name = "Breweries")
#Create a plot to demonstrate the number of breweries per state.
ggplot(Brews, aes(x = State)) +
geom_bar(stat = "count" , fill = "66a182") + ggtitle("Brewery Count by State") +
theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + labs(x="State",y="Count")
MissingValues <- sapply(Brews, function(MissingValue)sum(is.na(MissingValue)))
view(MissingValues)
## Create a column chart of NA counts to determine how many NA values there are in the dataset
na_table <-data.frame(
na_col = c('Brewery_id','Beer','Beer_ID','ABV','IBU','Style',
'Ounces','Brewery','City','State'),
na_cnt = colSums(is.na(Brews))
)
ggplot(na_table) +
geom_col(aes(x=reorder(na_col, -na_cnt), y=na_cnt),fill = "plum1") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x="Column Name", y="NA Counts")
#Instead of sampling the column, we can impute columns by taking the mean of their brewery ID.
#In the absence of # brewery ID mean, we take the overall column mean.
Brews$ABV <- ifelse(is.na(ave(Brews$ABV,Brews$Brewery_id,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))),mean(Brews$ABV, na.rm = TRUE),
ave(Brews$ABV,Brews$Brewery_id,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x)))
Brews$IBU <- ifelse(is.na(ave(Brews$IBU,Brews$Brewery_id,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))),mean(Brews$IBU, na.rm = TRUE),
ave(Brews$IBU,Brews$Brewery_id,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x)))
## Remove all NA (Null) values None should be left after imputation.
Brews2 <- Brews[complete.cases(Brews),]
na_table <-data.frame(
na_col = c('Brewery_id','Name.x','Beer_ID','ABV','IBU','Style',
'Ounces','Name.y','City','State'),
na_cnt = colSums(is.na(Brews2))
)
ggplot(na_table) +
geom_col(aes(x=reorder(na_col, -na_cnt), y=na_cnt),fill = "blue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x="Column Name", y="NA Counts")
#Utilizing the updated data 'Brews2' we can now search for the median ABV and median IBU by grouping by state.
# Median of ABV
MedABV <- Brews2 %>%
group_by(State) %>%
summarise_each(funs(median), Med_ABV = ABV)
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#Plot of Median ABV by State
ggplot(data=MedABV, aes(x=State, y=Med_ABV)) +
geom_bar(stat="identity", fill="cyan3")+ ggtitle("Median ABV by State") +
theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + labs(x="State",y="ABV")
# Median of IBV
MedIBU <- Brews2 %>%
group_by(State) %>%
summarise_each(funs(median), Med_IBU = IBU)
#Plot of Median ABV by State
ggplot(data=MedIBU, aes(x=State, y=Med_IBU)) +
geom_bar(stat="identity", fill="pink")+ ggtitle("Median IBU by State") +
theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + labs(x="State",y="IBU")
5a. Which state has the maximum alcoholic (ABV) beer? Colorado has the max ABV beer. Colorado’s Lee Hill Series vol.5-Belgian Style Quadrupel Ale has the highest Alcohol by Volume (ABV), at 12.8%
#We are only looking for one answer but this code is taking the top 5
#by using the top_n function. In order to see the information, by scrolling
#over the data point, a ggplot was stored under the object p and then
#ggplotly was applied to the object ‘p’
ABV = top_n(Brews2, 5, ABV)
p <- ggplot(data = ABV) + geom_point(mapping=aes(x=ABV, y=ABV, color=State))
ggplotly(p)
5b. Which state has the most bitter (IBU) beer? Oregon has the most bitter beer. Oregon’s Bitter Bitch Imperial IPA has the highest International Bitterness Units (IBU), at 138.
#The same principle as 5a was applied to the Category IBU
IBU = top_n(Brews2, 5, IBU)
p <- ggplot(data = IBU) + geom_point(mapping=aes(x=IBU, y=IBU, color=State))
ggplotly(p)
#ggplot was used to get an image of each States ABV distribution. This
#was done through the use of box plots which provided basic statistics
#(mean, median, outliers, skewedness) to help conduct an Exploratory Data Analysis (EDA).
#The summary function provides the same data but in numeric format and as a nation, not by State.
ggplot(data=Brews2, mapping=aes(x=reorder(State,ABV), y=ABV ))+geom_boxplot(color="black", fill="green", alpha=0.2) + geom_jitter(width = 0.15) +
theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + ggtitle("Distribution of ABV Across All States") + labs(x="State",y="ABV")
summary(Brews2$ABV)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00100 0.05000 0.05700 0.05981 0.06700 0.12800
summary(Brews2$IBU)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 28.00 42.71 44.12 55.00 138.00
#A scatter plot was created in ggplot, using the numeric categories ABV & IBU. In order to
# see if there is a relationship between ABV & IBU a linear regression was added
#in the form of the geom_smooth function.
p <- ggplot(data=Brews2, mapping=aes(x=ABV, y=IBU))+geom_point()+
geom_smooth()+ggtitle("RELATIONSHIP BETWEEN ABV AND IBU")
ggplotly(p)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#We wanted to diplay on a map where are target state was so we loaded maps data
#and plotted New Mexico's coordinates
usa = map_data("usa")
s <- ggplot() +
geom_polygon(data = usa, aes(x = long, y = lat, group = group), fill = "white", color = "red") +
coord_quickmap()
#New Mexico Coordinates
NewMexico <- tibble(
long = c(-105.8701),
lat = c(34.519934),
names = c("New Mexico"))
s + geom_point(data = NewMexico, aes(x = long, y = lat), shape = 21, color = "black", fill = "white", size = 5) +
geom_text(data = NewMexico, aes(x = long, y = lat, label = names), hjust = 0, nudge_x = 1, color = "black")
83% chance that the beer is an IPA at K=5 and an 82.6% chance that the beer is an IPA at K=15
##A new column was created that contains 'IPA, ALE, Other.' The grepl function was used in conjunction with an if/else statement. The desired word was isolated using ‘\\b’ which creates boundaries around
#a word so that only the letters within the boundaries were pulled. If the string did not contain IPA or Ale, it was given the name 'other.' 'Other' indicates that the beer will not be used in the analysis.
#The new data was then transformed into a data frame so it can be displayed through ggplot. In this case a scatter plot was created which will serve as the set up to conducting a KNN classification
Brews2$ALE = ifelse(grepl("\\bIPA\\b", Brews2$Style,ignore.case=T),"IPA",
ifelse(grepl("\\bAle\\b",Brews2$Style, ignore.case=T), "Ale","Other"))
Brews3 = data.frame(Brews2[- grep("Other", Brews2$ALE),])
Brews3%>%ggplot(aes(x=ABV, y=IBU, color=ALE))+geom_point()+ggtitle("WHICH TYPE OF ALE IS IT LIKELY TO BE")
8a. KNN classification to investigate this relationship. Provide statistical evidence one way or the other.
#Conducting a KNN requires using the class, caret and e1071 packages
# KNN start with a reference point and a question. 'Given an ABV of .062 and IBU of 62.5, what are the odds that the beer is an IPA or Ale
#The answer is based on looking at both 5 and 15 of the nearest points to our reference point.
NM = data.frame(ABV=.062, IBU=62.5)
knn(Brews3[,c(3,4)], NM, Brews3$ALE, k=5,prob=TRUE)
## [1] IPA
## attr(,"prob")
## [1] 0.6
## Levels: Ale IPA
knn(Brews3[,c(3,4)], NM, Brews3$ALE, k=15,prob=TRUE)
## [1] Ale
## attr(,"prob")
## [1] 0.6
## Levels: Ale IPA
8b. Other methods or techniques you have learned to investigate If you run a Naïve Bayes on the most popular ABV (.065) & the most popular IBU (65), there is a 60% chance the beer will be an IPA. If you run a Naïve Bayes on the 2nd most popular ABV (.07) and the 2nd most popular IBU (70), there is a 78% chance it will be an IPA. If we want to break records with our limited edition and compete with both Oregon (138 IBU) and Colorado (12.8% Alcohol by Volume), according to Naïve Bayes, we should DEFINITELY CREATE AN IPA.
# We isolated beers we want to assess by looking at the ones that are greater than our reference beer
#Filter Beers to ABV >= .062 and IBU >=62.5 and named the object Brews_Great.
#Brews3
Brews_Great = filter(Brews3, ABV >= .062 & IBU >= 62.5)
##The next step in preparing for another statistical study is to see the most popular ABV’s
#We did this by counting the number of Beers that have a specific ABV
#To plot the Top 10 in Bar Graph, we will need the data.table package to help speed up the counting of ABV.
#For a more coherent table, we changed the column header to 'Count' so we know what the column refers to. We then took the top 10 most popular ABV and visualized them through a ggplot bar graph.
library(data.table)
ABV_Count = data.frame(table(Brews_Great$ABV))
colnames(ABV_Count) = c("ABV","Count")
Top_ABV_Count = top_n(ABV_Count, 10, Count)
ggplot(data=Top_ABV_Count)+geom_bar(mapping=aes(x=ABV, y=Count, fill=ABV), stat="identity")+
ggtitle("Most Popular ABV") + theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + labs(x="ABV",y="Count")
#The same method was applied to IBU so that we have a clear look at the most popular ABV and IBU.
#The visual will only show the top 10 most popular ABV & IBU since popularity is requirement for our test.
IBU_Count = data.frame(table(Brews_Great$IBU))
colnames(IBU_Count) = c("IBU","Count")
Top_IBU_Count = top_n(IBU_Count, 10, Count)
ggplot(data=Top_IBU_Count)+geom_bar(mapping=aes(x=IBU, y=Count, fill=IBU), stat="identity")+
ggtitle("Most Popular IBU") + theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + labs(x="IBU",y="Count")
#The set-up is now complete and we are ready to conduct our second statistical analysis.
#We are conducting a Naive Bayes, using the most popular ABV and IBU to answer the following question; given the most popular ABV and IBU, what are #the odds that the beer will be an IPA or Ale?
model = naiveBayes(ALE~ABV+IBU,data=Brews3)
test = data.frame(ABV=.065, IBU=65)
predict(model,test, type="raw")
## Ale IPA
## [1,] 0.405742 0.594258
#We counted the style of all beers for all of the country grouped by beer style
#we learned that the American IPA is the most popular style of beer in the country.
style_all <- ggplot(data=Brews2, mapping=aes(x=Style))+geom_bar(stat = "count")+
theme(axis.text.x=element_text(size=rel(0.6), angle=90)) + ggtitle("Style Count") + labs(x="Style",y="Count")
style_all
#In attempt was made to generate a filtered table of just New Mexico's styles beer
#but the code as not working as expected.
NM_Style = filter(Brews, State == 'NM')
NM_Style
## [1] Brewery_id Beer Beer_ID ABV IBU Style
## [7] Ounces Brewery City State
## <0 rows> (or 0-length row.names)
#create a scatter plot with x y of the specific style to determine a type of beer
americanIPA <-filter(Brews, Style == 'American IPA')
ABV_IBU <- ggplot(data=americanIPA, mapping=aes(x=ABV, y=IBU, color = IBU))+geom_point()+
theme(axis.text.x=element_text(size=rel(0.8), angle=45)) + ggtitle("American IPA ABV v. IBU") + labs(x="ABV",y="IBU")
ABV_IBU2 <-ABV_IBU+scale_color_gradientn(colours = rainbow(5))
ggplotly(ABV_IBU2)
Summary
#Plot target state with population gradient
plot_usmap(data = countypop, values = "pop_2015", include = c("NM"), color = "red") +
scale_fill_continuous(low = "white", high = "red", name = "Population", label = scales::comma) +
labs(title = "New Mexico Region") +
theme(legend.position = "right")
Our linear relationship between ABV and IBU offers an opportunity to increase sales by choosing a state who likes a high ABV and IBU. Our company has high hopes for the state of New Mexico. The state has only 14 breweries, making up less than 1% of the country’s breweries, and enjoys the taste of bitter beer and high alcohol content. After choosing a state to tap into, we then investigated the differences between IBU and ABV between IPAs and other types of ALE. To investigate this relationship, we used a KNN classification tool. Over 80% of the time, our beer will be an IPA. According to that prediction, IPAs are much more popular than ales. Afterward, we wondered how much ABV and IBU people in this country like in their beer? We discovered that there is a preference in this country for an ABV of 6.5% and an IBU of 65. Once we have determined the equilibrium point, we studied New Mexico’s breweries for their styles of beer. A variety of styles of beer were available; the American IPA was popular, but accounted for less than 30%. Having researched all the styles of beers across the country, we were pleasantly surprised to discover that the classic American IPA was one of the most popular beer styles. Finally, we wanted to find an American IPA that Budweiser could use as a benchmark. To ensure this beer met the country’s preference, we made sure it was in the balance zone. Our beer review found that 98 Problems (Cuz A Hop Ain’t One) by Perrin Brewing Company has an ABV of 6.5% (> ABV median) and an IBU of 65 (> IBU median). With an ABV and IBU that are exactly right, this beer is the perfect one for Budweiser to review as it seeks to develop its own IPA.